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Abstract 

Software security vulnerabilities are a major threat for software 
systems. In the worst case, vulnerabilities in software enable users 
to gain unauthorized access or unauthorized control of an appli- 
cation. A large amount of software security vulnerability exploits 
such as buffer overflows, SQL injections, cross-site scripting at- 
tacks, etc. are caused by data flowing from untrusted program input 
sources into sensible program functions. We define a tainted path as 
a program execution path from an untrusted program input source 
into a sensible program location. This paper presents a static taint 
analysis that computes tainted paths in C programs and that doesn’t 
require any program annotations. Our static taint analysis algorithm 
is built upon the iterative dataflow framework 11111171 and has been 
implemented in the tool SAINT (Simple Static Taint Analysis Tool). 
Our static taint analysis is interprocedural, flow-sensitive, and de- 
velopers can choose to run it either with context-sensitivity or with- 
out. We have implemented our analysis using the LLVM compiler 
infrastructure. 

Categories and Subject Descriptors CR-number [subcategory]: 
third-level 

General Terms Software Security Vulnerabilities, Program Anal- 
ysis, Static Code Analysis, Static Taint Analysis 

Keywords tainted path, static taint analysis, static code analysis 

1. Introduction 

Software vulnerabilities are security threats that exist in an ap- 
plication. Software vulnerabilities allow malovelent users to exer- 
cise unauthorized control of the application through supplied in- 
put. There are several kinds of software vulnerabilities: buffer over- 
flows, format string attacks, SQL injection, etc. Researchers have 
worked on dynamic H[[0), static Iill4ll5ll9lfl0l[l5ll22l[25l. and 
hybrid techniques ||24l to find security vulnerabilities in software. 

This paper introduces the concept of tainted path. A tainted path 
is a program execution path from a program input source into a 
sensible program location. A tainted path represent a software vul- 
nerability. This paper presents a static taint analysis that computes 
tainted data and tainted paths in C programs. Our implementation 
of the taint analysis uses the LLVM framework dh, and does not 
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require user annotations. Our static taint analysis is flow-sensitive, 
interprocedural, and developers can choose to run it with context- 
sensitivity or without. In taint analysis, a source is a program lo- 
cation that allows a value from the environment into the program. 
This may occur through the return value of a system call, user input, 
etc. A value from the program environment that has not been val- 
idated and sanitized is called a tainted value. A sink is a program 
location that uses a tainted value . 

Data validation is the process of checking that data has the 
expected form. For instance, checking that a string input has the 
format of an email address. Data sanitization is the process of 
checking that validated data is safe in a particular context. For 
instance, escaping string input before using it in a SQL query. 
A function that sanitizes application’s external input is called a 
sanitizer. Once a value has been sanitized, it is tagged as not 
tainted. In the following, we assume that sanitized data has been 
validated. 

Static taint analysis searches for tainted values and warn de- 
velopers for each tainted value so they can validate and sanitize 
the tainted value to avoid software vulnerability exploits at run- 
time. Taint analysis proceeds by first tagging values from sources 
as tainted. Once tagged, the tainted values are propagated through 
the entire program. 

Taint propagation is the process of marking values as tainted if 
they result from an operation that involved tainted data. This can be 
an arithmetic operation (addition, multiplication, etc.), a program 
assignment or other type of program instructions. Finally, a taint 
analysis emits a warning whenever a tainted value is used at a sink 
location. Taint propagation can be data-flow or control-flow based. 
Data-flow based taint propagation exists due to data dependencies 
in the program (e.g. assigning the value of tainted variable s u 
to another variable Sd )• Control-flow based taint propagation is 
due to control dependencies (e.g. if tainted variable St is used 
in a branch condition, values from program instructions inside 
that branch become tainted.). Data-flow based taint propagation is 
also called explicit taint propagation, and control-flow based taint 
propagation is called implicit taint propagation. Our static taint 
analysis searches for tainted paths and implements both control- 
flow and data-flow based taint propagation. 

This paper makes the following contributions: 

• It introduces the concept of tainted path, which is a program 
execution path from a taint source to a taint sink. 

• It shows that several static analysis problems can be reduced to 
a tainted path computation problem. 

• It describes SAINT, a whole-program static taint analysis that 
is flow-sensitive, interprocedural and context-sensitive. SAINT 
computes tainted paths in C programs and is available for 
download at https : / /github . com/xaviernoumbis/ saint 



Abstract intructions 


Description 


Code in C 


Formal description 


ALLOC 


Memory allocation 


v = malloc{...) 


v £ T 


COPY 


Copy instruction 


p = q 


p,qeT 


LOAD 


Load instruction 


p = *q 


p £ T ,q £ A 


STORE 


Store instruction 


*p = q 


p € A, q € T 


CALL 


Call instruction 


r = call func ( p ) 


r £ T 



Table 1: LLVM abstract instructions types. In LLVM intermediate representation, A and T represent respectively the set of address-taken 
variables and the set of top-level variables. 



Saint’s taint analysis is sound (i.e. all tainted paths are re- 
ported). 

To the best of our knowledge, SAINT is the only tool that 
implements static taint analysis for a static language using the 
iterative dataflow framework [ TTHTTi l. Please look at Table[7]in 
Section[6] 

The paper is organized as follows: Section[2]introduces the con- 
cept of tainted path. Section[3]introduces our running example, and 
Section[4]gives an overview of the LLVM intermediate representa- 
tion. Section [5] presents the taint analysis algorithm, and Section[6] 
presents experimental results. Finally, Section [7] discusses related 
work and Section[8]concludes. 



2.1 Buffer-overflow detection 

2.2 Automatic test cases and test data generation 

Given a tainted path, one can combine symbolic execution and 
constraint solving to generate data needed to execute the tainted 
pah. The generated data, along with the given tainted path represent 
a test case (or an exploit). Developers can then change the code that 
cause the bug (or vulnerability). 

3. Motivating Example 



1 void func _sql ( int ) ; //sink 



2. 


Tainted Paths 


1 






void mysql_tain 


3 






uint calculate! 


4 


Ll 




u i n t sum = 0 ; 


5 


L2 




uint i = 0; 


6 


L3 




if (x == 2) 


7 


L4 




scanf ( ”%d” , 


8 


L5 




else 


9 


L6 




sum = 0 ; 


10 


L7 




if (i >= x) 


11 


L8 




goto L12; 


12 


L9 




sum = sum + i 


13 


L10 


i = i + 1; 


14 


Lll 


mysql_taint(i 


15 


L12 


goto L6; 


16 


L13 


return sum; 


17 






} 



t ( u i n t ) ; //taint sink 
ui n t x ) { 

&sum ) ; //taint source 



); 



Figure 1: Code example in three-address format 

In this section we introduce the term tainted path. We define a 
tainted path as a program execution path from a taint source to a 
taint sink. Let us consider the three-address code in Figure [Tj we 
represent a line of code with Lx where L means line and x repre- 
sents a line number. The program path < L4, L7, L9, L10, Lll > 
defines a tainted path while program paths < L4, L7, L9, L10 > 
and < L4, L7, L8 > don’t define tainted paths. We postulate that 
several static analysis problems can be reduced to a combination 
of tainted paths computation and test data generation. For instance, 
this is the case for the following static analysis problems: 



3 int compute (int x) { 

4 int sum = —1; 

5 if (x == 2) 

6 sum = func_sql (x); 

7 return sum; 

8 } 

10 int mainQ { 

11 int x , y ; 

12 scanf(”%d”, &x ) ; 

13 y = compute (x); 

14 return 0; 

15 } 



Figure 2: Motivating example 

This paper uses an example inspired from the example de- 
scribed in J3’|. Figure [2] shows 2 functions: main and compute. In 
main, the function scanf from the C standard input/ouput library 
gets an integer input from the user at line 3 and stores it in variable 
x. x then becomes tainted bacause it holds a value from the environ- 
ment which has not been validated and sanitized, x is later used as 
argument to function compute at line 4. In compute, variable sum 
gets tainted at line 10 through function scanf. The formal param- 
eter x gets tainted if only if a tainted actual argument was passed at 
calling sites. This is for instance the case at line 4 in function main. 
Observe that if a tainted parameter x is used at the calling site of 
line 4 in function main: this leads to a case of control-flow based 
taint propagation at line 10 and at line 11 of function compute. 
Variable sum also becomes tainted at line 1 1 because the statement 
at line 11 is control-dependent of the conditional expression at line 
10. 



• Buffer-overflow detection. 

• SQL injection vulnerability detection. 

• Format string vulnerability detection. 



4. LLVM 

This section gives an overview of LLVlvQ (Low Level Virtual Ma- 
chine) and its intermediate representation (IR), which we use as 



• Automatic test cases and test data generation. 



1 http://llvm.org 



Abstract instructions 


Code in C 


GEN-SET 


KILL-SET 


ALLOC 


s: v = malloc(...) 


0 


0 


COPY 


s:p=q 


{p} iff q £ IN[s] 


0 


LOAD 


s: p = *q 


{tj\tj = toplevel(aj) A a, £ points do^(q) Atj £ IN[s]} 


0 


STORE 


II 

a 

* 

Cfl 


{tj\tj = toplevel(aj) A ay £ points Jo^(p)} iff q £ IN[s] 


0 



Table 2: Gen- and kill-sets for the abstract instructions ALLOC, COPY, LOAD, and STORE 



basis for the description of our taint analysis. LLVM is a compiler 
framework Cl that contains several components and libraries that 
help developers in building compilers and compiler tools (e.g. static 
analyses, etc.). LLVM primarily processes source code written in C, 
C++, and Objective C. LLVM libraries are written in C++. Table[I| 
shows the abstract instruction types we consider in the LLVM IR 
for our analysis. In the following, we present the LLVM interme- 
diate representation. Our presentation is based on the descriptions 
given by Hardekopf et al. |8j| and by Lhotak et al. fl4i . LLVM's IR 
uses partial static single assignment (partial SSA) and assumes the 
existence of two types of variables in C code: top-level variables 
and address-taken variables. 

4.1 Top-level variables 

Top-level variables are variables that are never accessed via a 
pointer in the program code. LLVM converts top-level variables 
into SSA form when building the LLVM IR. The memory address 
of top-level variables is never copied to another variable (i.e. they 
are never applied the address-of operator (&) in the C programming 
language). In the LLVM IR, top-level variables are only accessed 
using ALLOC and COPY instructions. This paper denotes the set 
of top-level variables with T. In Figure [2] 61, and 62 are top-level 
variables ({61, 62} £ T). 



5.1 Taint sources and taint sinks 

Program statements that initially taint variables ( taint sources) are 
discovered during the intraprocedural analysis, described later in 
this section. The analysis handles per default a subset of the C 
standard library as taint sources: get c, scanf, gets, fopen, 
etc. Functions that use tainted variables ( taint sinks) are gradually 
discovered during the various phases of the analysis. SAINT has 
a configuration file where developers can register additional taint 
source and taint sink functions. 

5.2 Taint propagation 

SAINT performs explicit and implicit taint propagation (data- and 
control-flow taint propagation). Explicit taint propagation tracks 
variables that are tainted due to assignment statements. The assign- 
ment in line 6 of Figure [2] is an instance of explicit taint propaga- 
tion. Variable y becomes tainted since it gets assigned the return 
value of compute, which is a tainted value. 

Implicit taint propagation takes into account tainted variables used 
in control conditions. In Figure[2]for instance, variable x is passed 
to the function compute as actual parameter as a tainted variable 
during the context-sensitive analysis. This implies that variables 
sum and i becomes tainted at line 15 of Figure [2] because x is 
tainted and is part of the for-loop boolean condition at line 14. 



4.2 Address-taken variables 

Address-taken variables are never accessed directly through their 
first declared name. Address-taken variables are only accessed in- 
directly with pointer variables and LOAD and STORE instruc- 
tions. In fact, address-taken variables are those ones on which the 
address-of operator (&) was applied. This paper uses A for the set 
of all address-taken variables. Variable x in Figure[2]is for instance 
an address-taken variable (x £ .4). 

4.3 Representation of program expressions 

5. Staged Static Taint Analysis 

Our taint analysis is interprocedural and runs either context- 
insensitively or context-sensitively. Any form of the interproce- 
dural analysis is always preceded by an intraprocedural analysis 
that computes initial taint information that is reused by the in- 
terprocedural analyses. The intraprocedural analysis detects taint 
sources and initializes a summary table which contains taint in- 
formation about program functions’ formal parameters and return 
value. The use of a summary table allows fast access to key in- 
formation about program procedures. This is especially useful 
during the subsequent interprocedural phases. For instance, the 
intraprocedural analysis would detect that variable sum of pro- 
cedure compute in Figure [2] which also holds the return value 
of compute, may be tainted due to the call to scanf at line 
12. Figure [3] shows the architecture of our taint analysis SAINT 
and Table pfshows the transfer functions for the abstract program 
statements M>LOC, COPY, LOAD, STORE, and CALL. These 
transfer functions apply for the intraprocedural and the interproce- 
dural analyses. 



5.3 Sanitizers 

Sanitizers are functions that developers use to make sure that 
tainted data are safe to use in sensible (or vulnerable) program 
functions. SAINT creates kill-sets whenever a sanitizer is found on 
a tainted path. 
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Figure 3: SAINT analysis architecture 



5.4 Formalisms 

This paper uses the following elements to describe the taint analy- 
sis as an instance of the iterative dataflow analysis framework 03: 






